This notebook demonstrates how to reproduce the results of our TPAMI paper on NLP tasks.
Caveat: The results may vary from the published version: the published paper reports results obtained from Matlab code, while this is a rewritten Python version. The Python version is the only one we distribute, as it is much cleaner and simpler to run as the Matlab version.
The data splits or folds are specified in the data/datasplit_* files. The random seeds are all 0 as can be seen in the command lines. The experiments were run on our lab's SGE computing cluster, named Fear. The SGE and Python command lines are scripts generated by Python programs (src/fear.py), so that the experimental configuration you can find here is rather self-contained.
In [3]:
    
pygpstruct_location = '/home/sb358/pygpstruct'
pygpstruct_fear_location = '/home/mlg/sb358/pygpstruct'
result_location = '/bigscratch/sb358/pygpstruct/results'
%load_ext autoreload
%autoreload 2
import sys
sys.path.append(pygpstruct_location + '/src/') # replace by your path to .py files
np.set_printoptions(precision=3)
import fear
    
In [9]:
    
for task in ['basenp', 'chunking', 'segmentation', 'japanesene']:
    n_data = {'basenp' : 300, 'chunking' : 100, 'segmentation' : 36, 'japanesene' : 100}[task]
    n_data_train = {'basenp' : 150, 'chunking' : 50, 'segmentation' : 20, 'japanesene' : 50}[task]
    files_prefix = result_location + '/2014-08-22_%s/' % task
    data_indices = np.loadtxt(pygpstruct_location + '/data/datasplit.n_data=%s.txt' % n_data, dtype=np.int16) - 1 # need -1 because doing +1 inside prepare_data_chain
    for fold in range(5):
        fear.launch_qsub_job({ 
            'n_samples' : '250000', 
            'prediction_thinning' : '1000', 
            'lhp_update' : "{'binary' : np.log(1)}",
            'data_indices_train' : 'np.array(%s)' % str(data_indices[fold,:n_data_train].tolist()),
            'data_indices_test' : 'np.array(%s)' % str(data_indices[fold,n_data_train:].tolist()), 
            'data_folder' : "'" + pygpstruct_fear_location + "/data/%s'" % task,
            'task' : "'%s'" % task
            },
            job_hash = 'qsub_' + str(fold), 
            files_prefix=files_prefix, 
            repeat_runs=8)
    
    
In [5]:
    
#!ssh fear qdel -u sb358
!ssh fear qstat
!date
    
    
In [1]:
    
!tail -n 1 /bigscratch/sb358/pygpstruct/results/2014-08-22_*/qsub_*.log
#!ls -l /bigscratch/sb358/pygpstruct/results/2014-08-22_basenp/*
    
    
In [35]:
    
# check state of a job
import pickle
with open("/bigscratch/sb358/pygpstruct/results/2014-08-22_japanesene/qsub_3.lhp_update=binary:np.log1++n_samples=250000++prediction_thinning=1000++task=japanesene.state.pickle", 'rb') as f:
    a=pickle.load(f, encoding='latin1')
print(a)
    
    
In [8]:
    
import util 
util.make_figure([3], 
                 [('segmentation', '/bigscratch/sb358/pygpstruct/results/2014-08-22_segmentation/*.results.bin' ),
                  ('chunking', '/bigscratch/sb358/pygpstruct/results/2014-08-22_chunking/*.results.bin' ),
                  ('japanesene', '/bigscratch/sb358/pygpstruct/results/2014-08-22_japanesene/*.results.bin' ),
                  ('basenp', '/bigscratch/sb358/pygpstruct/results/2014-08-22_basenp/*.results.bin' ),
                  ], top=0.15, bottom=0.04)
    
    
    
In [104]:
    
# Matlab line to regenerate data splits: n_data=150;fold=1; rand('state', fold); r=randperm(n_data*2);save(sprintf('~/n_data=%s.fold=%s.mat', int2str(n_data), int2str(fold)), 'r')
import scipy.io
print(scipy.io.loadmat('/home/sb358/n_data=150.fold=1.mat'))
# convert to txt format
for n_data in [18, 50, 150]:
    a = np.empty((5, n_data*2), dtype=np.int16)
    for fold in range(1,6):
        a[fold-1, :] = scipy.io.loadmat('/home/sb358/n_data=%s.fold=%s.mat' % (str(n_data), str(fold)))['r']
    np.savetxt('/home/sb358/pygpstruct/data/datasplit.n_data=%s.txt' % str(n_data*2), a, fmt="%g")
    
    Out[104]: